Computational method of the cardiovascular diseases classification based on a generalized nonlinear canonical decomposition of random sequences

Decision support systems can seriously help medical doctors in the diagnosis of different diseases, especially in complicated cases. This article is devoted to recognizing and diagnosing heart disease based on automatic computer processing of the electrocardiograms (ECG) of patients. In the general case, the change of the ECG parameters can be presented as a random sequence of the signals under processing. Developing new computational methods for such signal processing is an important research problem in creating efficient medical decision support systems. Authors consider the possibility of increasing the diagnostic accuracy of cardiovascular diseases by implementing of the new proposed computational method of information processing. This method is based on the generalized nonlinear canonical decomposition of a random sequence of the change of cardiogram parameters. The use of a nonlinear canonical model makes it possible to significantly simplify the maximum likelihood criterion for classifying diseases. This simplification is provided by the transition from a multi-dimensional distribution density of cardiogram parameters to a product of one-dimensional distribution densities of independent random coefficients of a nonlinear canonical decomposition. The absence of any restrictions on the class of random sequences under study makes it possible to achieve maximum accuracy in diagnosing cardiovascular diseases. Functional diagrams for implementing the proposed method reflecting the features of its application are presented. The quantitative parameters of the core of the computational diagnostic procedure can be determined in advance based on the preliminary statistical data of the ECGs for different heart diseases. That is why the developed method is quite simple in terms of computation (computing complexity, accuracy, computing time, etc.) and can be implemented in medical computer decision systems for monitoring cardiovascular diseases and for their diagnosis in real time. The results of the numerical experiment confirm the high accuracy of the developed method for classifying cardiovascular diseases.

At present, the analysis process, as a rule, is a study of isopotential and other maps generated from the data obtained using the software supplied with the signal recording apparatus.
In medical practice, the conclusions of cardiologists about patients' diagnoses have, as usual, qualitative or verbal character and are not always confirmed by enough number of quantitative data. In special or difficult situations with disease recognition, diagnosis errors by young or insufficiently experienced medical doctors are possible, and the real diagnostic process may be significantly extended until a final correct decision about the truth diagnosis.
Decision support systems can seriously help medical doctors in the decision-making processes about the diagnosis of different heart diseases, especially in complicated cases. The most perspective approach is based on the recognizing and diagnosing heart disease using automatic computer processing of the electrocardiograms (ECG) of patients.
In the general case, the change of the ECG parameters can be presented as a random sequence of the signals under processing. Developing new computational methods for such signal processing is an important research problem in creating efficient medical decision support systems.
In this article, the authors consider the possibility of increasing the diagnostic accuracy of cardiovascular diseases by implementing the computational method, which is based on the generalized nonlinear canonical decomposition of a random sequence of the change of cardiogram parameters. The absence of any restrictions on the class of random sequences under study makes it possible to achieve maximum accuracy in diagnosing cardiovascular diseases. The main advantage is that quantitative parameters of the computational diagnostic procedure can be determined in advance based on the preliminary statistical data of the ECGs for different heart diseases. That is why the developed method is quite simple in terms of computation (computing complexity, accuracy, computing time, etc.) and can be implemented in medical computer decision systems for monitoring cardiovascular diseases and for their diagnosis in real time.
Thus, the development of efficient mathematical models and computation methods for identifying the highaccuracy individual characteristics of an electrocardiogram (with subsequent classification), as well as the creation of an automated computer diagnostic support system, is an urgent and important task in "medicine-computer science" multidisciplinary research.
The rest of the article covers multiple aspects related to the topic discussion. "Background and analysis of the related works" section consists of the analysis of the related works in the field of ECG processing. In "Problem statement" section authors formulate the problem statement. "Solution" section deals with the development of the computation method and corresponding mathematical model based on the generalized nonlinear canonical decomposition of a random sequence of the change of cardiogram parameters. "Results of the numerical experiment" section represents the modeling and simulation results for different existing methods and comparative results with the proposed computational method. The paper ends with a conclusion in "Conclusion" section.

Background and analysis of the related works
Diagnostics of electrocardiograms consist of three successive stages: (a) preliminary processing, (b) feature extraction (normalization), and (c) classification. Let us analyze all these stages consequently.
Fist stage-Preprocessing reduces signal measurement noise by smoothing the electrocardiogram signal, reducing drift suppression and baseline deviation. The most common existing methods used to reduce signal noise are (a) second order low pass and (b) high pass Butterworth filters 10 , (c) Daubechies wavelet 11 and (d) orthogonal wavelet-filter 12 . Besides, for baseline adjustment, such techniques as median filter, linear phase high pass filter, mean median filter and others are used also.
Second stage-Feature extraction is an interactive process that includes a series of automatic data transformation procedures. In cases with a large number of measurements-features that describe the characteristics of the input signal, the correlation and factor analysis of data can be used to reduce the dimension of the problem. According to the extraction and analysis methods, the features can be divided into the following categories: • Temporary features 13 (these features are described in the time domain, representing amplitude, slope and heart rate); • Spectral features 14,15 (features are defined in the frequency domain, account for spectral concentration, normalized spectral moments); • Time-frequency/wavelet features 16,17 (features extracted from the results of the wavelet transform applied to the electrocardiogram signal); • Signs of the complexity of geometric distortions 18 (these signs include various calculations related to the complexity of the considered segment of the electrocardiogram).
The stage of feature extraction ends with the optimization of their number, which allows reducing the set of redundant functions, reducing computational costs and improving the overall performance of the system. This step uses the following three main categories of feature selection methods: (a) wrapper methods (recursive feature elimination 14 ; direct feature selection 19 ; genetic algorithms 20 ; (b) filter methods (correlation) 21 ; Chi-Square 15 ; analysis of variance (ANOVA) 22 ; ReliefF 23 ; (c) built-in methods 24 .
To date, a large number of approaches have been developed to solve the problem of diagnosing cardiovascular diseases using various mathematical methods.
Third stage-The classification process can be carried out in several iterations, depending on the chosen recognition scheme. In some cases, the results obtained at this stage require a revision of the entire processing scheme as a whole. The most common classification methods are: discriminant function 25 28 ; support vector machine 29 ; k-Nearest Neighbors (k-NNs) 30 ; Decision Trees (DT) 31 . Several different approaches for ECG analysis are based on a chaos theory 32 , a combination of statistical, geometric, and nonlinear heart rate variability features 32 , a semantic web ontology and heart failure expert system 33  At the same time, each of the above-mentioned methods has its drawbacks and limitations. That is why the need to develop new effective methods of medical diagnostics has not lost its relevance.
Thus, the change in the values of the electrocardiogram has a stochastic character; therefore, for the diagnosis of cardiovascular diseases, it is necessary to use methods for recognizing random functions and random sequences.
The main method for recognition of the realizations of random sequences is the Bayes decision rule, according to which a decision about the belonging of the realization to a certain class, for which the posterior probability is maximum, is made. The method is theoretically accurate, however, as well as many of its modifications (the Neyman-Pearson, Wald criterion, etc. 39 ) is applied in conditions when the stochastic properties of classes of random sequences are fully known. If the prior probabilities of the classes of random sequences are not known, then equal values are assigned to them and the decision rule is modified into the maximum likelihood criterion. The criterion is especially important when solving problems of recognition, in which unlikely events cannot be excluded from consideration (diagnostics of emergency technical systems, medical diagnostics, etc.). However, for the maximum likelihood method, as well as for the Bayes rule, the problem of approximating the multivariate distribution density for a random sequence with a large number of sampling points remains unresolved.
The use of the canonical expansion of Pugachev 40 makes it possible to pass the decision rule from a multivariate distribution density to the product of one-dimensional distribution densities of uncorrelated random coefficients. However, this approach is valid only for Gaussian random sequences.
The aim of our study is to eliminate the abovementioned disadvantage and develop a diagnostic method that takes into account non-linear stochastic features of changes in cardiogram parameters.

Problem statement
The accumulated volume of statistical data on various cardiovascular diseases makes it possible to determine with high accuracy the characteristics of random sequences C(i/k), i = 1, I, k = 1, K describing the changes in the values of information signs of an electrocardiogram ( K includes ( K − 1 ) different diseases and one normal state of a person, I-the number of information signs of an electrocardiogram). For example, C(i/1), i = 1, I-Marfan syndrome; C(i/2), i = 1, I pulmonary embolism; C(i/3), i = 1, I-heart attack; C(i/4), i = 1, I-cardiomyopathy;C(i/5), i = 1, I-pericardial disease; C(i/6), i = 1, I-rheumatic heart disease; C(i/7), i = 1, I-stroke; C(i/8), i = 1, I-normal state of a person; in this case K = 8.
As a result of electrocardiography, a certain sequence of values c(i), i = 1, I can be obtained. It is necessary to determine to which class k * ∈ {1, .., K} this realization c(i), i = 1, I belongs.

Solution
Taking into account the specific features of the problem of diagnosing cardiovascular diseases, to analyze the cardiogram, as a rule 41  www.nature.com/scientificreports/ -the interval RR ; C(9)-the height of the tooth T ; C(10)-the interval QT ; C(11)-the interval ST ; C(12)-the interval TP ; C(13)-the width of the tooth U ; C(14)-the height of the tooth U. A universal method for recognizing a random sequence is the maximum likelihood criterion 43,44 , according to which the decision on whether the realization � c = {c(1), ..., c(14)} belongs to the class k * ∈ {1, .., K} is made when the following condition is met 45,46 : where f I (� c/k), k = 1, K, I = 14 is the conditional distribution density of the features c , provided that the realization belongs to this class.
Thus, to solve the recognition problem, it is necessary to obtain the estimate of unknown densities f I (� c/k), k = 1, K , which, in turn, taking into account rather large dimension ( I = 14 ) of the function, is a complex and time-consuming procedure. In the case of using the simplifying assumption ( about the presence in a random sequence of only stochastic relations between two arbitrary parameters, the problem is greatly simplified by means of transition from the sequence C(i), i = 1, 14 to the analysis of a set of independent random coefficients P i /k , k = 1, K, I = 14 allows us to write the decision rule in the following form The problem of recognition, therefore, is reduced to a successive approximation of twelve one-dimensional distribution densities.
The decision rule (1) is significantly simplified, however, the transition from the vector c to the vector p is possible provided that the random sequence C(i), i = 1, 14 has only stochastic relations E C ν j C µ (i) .
To eliminate all existing probabilistic relationships E C ξ g i − r g−1 ...C ξ 2 (i − r 1 )C ξ 1 (i) , we introduce into consideration the array of random variables: can be obtained by determining the cross-correlation of the elements the array C. Consider this array as a vector random sequence C , each component of which corresponds to a row of the array C . Applying the vector linear canonical decomposition to C gives the following expression for the first component of (7).
The computational method for diagnosing cardiovascular diseases consists in the realization of the following stages: (1) Estimation of moment functions E C ξ g i − r g−1 . . . C ξ 2 (i − r 1 )C ξ 1 (i) ,  The diagram of the functioning of the system of diagnostics of cardiovascular diseases based on the developed method is shown in Fig. 3. For the numerical experiment two hundred different cardiograms for each disease {C(i)/k}, i = 1, 14, k = 1, K, K = 9 were used from the Physikalisch-Technische Bundesanstalt (PTB) dataset, which is a publicly available database. This database is compiled by the National Metrology Institute of Germany. It contains combinations of digitized ECGs of both normal and abnormal subjects' recordings, which are provided for research 52 .

Random coefficients
Coordinate functions Input parameters: p 1 -increase of double product per one kilogram of the body weight of the sick; p 2increase of double product per one kilogram of physical exertion; p 3 -coefficient of phosphorilation; p 4 -age of the sick; p 5 -double product of pulse on arterial tension; p 6 -adenosinetriphosphoric acid; p 7 -adenosine diphosphoric acid; p 8 -adenylic acid; p 9 -coefficient of the ratio of lactic and pyruvic acid content;p 10 -maximal consumption of oxygen per one kilogram of the body weight of the sick; p 11 -increase of double product in the response for submaximal physical exertion; p 12 -tolerance to physical activity.
Expressions  Table 1, the diagnosed disease is stenocardia of the third functional class.
The value H for parameter p (1) is accepted in three cases (rule base Table 2 (iv) Neural network Daubechies wavelet function of the fourth order and the Levenberg-Marquardt algorithm for learning were used 59 .
Expressions for the determination of approximation coefficients and detailing of discrete wavelet transform are presented in the form:   Tables 4, 5 indicate the low efficiency of the linear criterion (12) (the minimum amount of a priori information is used:E C j C(i) ). The use of additional stochastic relations ( E C ν j C µ (i) ) in the criterion (13) makes it possible to achieve an increase in the accuracy of solving the problem of diagnosing cardiovascular diseases compared to (12). The maximum accuracy of diagnostics is achieved by applying the proposed decision rule (11) by maximizing the use of the stochastic properties ( E C ξ l i − r l−1 · · · C ξ 2 (i − r 1 )C ξ 1 (i) ) of the studied random sequences. The existing set of wavelet functions and the lack of a rigorous mathematical apparatus for analyzing fuzzy equations significantly limit the quality of decision-making about a cardiovascular disease based on the fuzzy logic method and neural network.

Conclusion
A computational method for computer systems for automated diagnosis of cardiovascular diseases based on a generalized nonlinear canonical decomposition of a random sequence of change of cardiograms has been obtained. The use of the canonical model made it possible to form the decision rule for the maximum distribution density in the form of a product of one-dimensional distribution densities of random coefficients. The canonical decomposition does not impose any significant restrictions (linearity, stationarity, Markov property, monotony, ergodicity, etc.) on the class of random sequences under study, which makes it possible to maximally take into account the stochastic characteristics of sequences related to various cardiovascular diseases. Taking into account the recurrent regularity of calculations, the diagnostic method is quite simple in terms of computation and allows using an arbitrary number of input parameters. A significant advantage of the method is the ability to use characteristics not directly related to the cardiogram (age of the patient, blood pressure, etc.).
During the operation of the diagnostic system based on the proposed computational method, new diseases unknown to medicine can be identified in the case of a significant difference in the values of the likelihood function for the investigated cardiogram and the classified cardiograms of known diseases.
The results of the numerical experiment indicate a high reliability of the diagnostics of cardiovascular diseases based on the proposed method.

Data availability
The datasets generated and analysed during the current study are available in the Physikalisch-Technische Bundesanstalt (PTB) repository, https:// physi onet. org/ physi obank/ datab ase/. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.